WEWRA: An algorithm for Wrapper Verification
نویسندگان
چکیده
Web wrappers play an important role in extracting information from distributed web sources and subsequently in the integration of heterogeneous data. Changes in the layout of web sources typically break the wrapper, leading to erroneous extraction of infomation. Monitoring and repairing broken wrappers is an important hurdle for data integration, since it is an expensive and painful procedure. In this paper we present VEWRA, a new approach to wrapper verification, which improves the successful family of trainable content based methods. Compared to its predecessors, the new method aims to capture not only the syntactic patterns but the correlations that exist among them due to the underlying semantics of the extracted information. Experiments show that our method achieves excellent performance, being always better or equal than DATAPROG, the state-of-art related work.
منابع مشابه
Wrapper Maintenance: A Machine Learning Approach
The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wr...
متن کاملFast SFFS-Based Algorithm for Feature Selection in Biomedical Datasets
Biomedical datasets usually include a large number of features relative to the number of samples. However, some data dimensions may be less relevant or even irrelevant to the output class. Selection of an optimal subset of features is critical, not only to reduce the processing cost but also to improve the classification results. To this end, this paper presents a hybrid method of filter and wr...
متن کاملRegression testing for wrapper maintenance
Recent work on Internet information integration ~sumes a library of wrappers, specialized information extraction procedures. Maintaining wrappers is difficult, because the formatting regularities on which they rely often change. The wrapper verification problem is to determine whether a wrapper is correct. Standard regression testing approaches are inappropriate, because both the formatting reg...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملAssessment of a 2D EPID-based Dosimetry Algorithm for Pre-treatment and In-vivo Midplane Dose Verification
Introduction: The use of electronic portal imaging devices (EPIDs) is a method for the dosimetric verification of radiotherapy plans both pretreatment and in-vivo. The aim of this study was to test a 2D EPID-based dosimetry algorithm for dose verification of some plans inside a homogenous and anthropomorphic phantom and in-vivo, as well. Materials and Methods: </strong...
متن کامل